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Example 2: Identification of No>eI Human Aspartyl Proteases Using Database 
Mining by Genome Bridging 

Materials and Methods: 
Ccmpuier-Qssisied analysis of EST databases, cDNA x and predicted polypeptide 
sequences: 

Exhaustive homology searches of EST databases with the CEASPL F21F8.3, 
F21FM, and F21F8.7 sequences failed to reveal any novel mammalian horaologues. 
TBLASTN searches with R12H7.2 showed homology to caihepsin D, cathepsin E, 
pepsinogen A, pepsinogen C and renin, particularly around the DTG motif within the active 
site, but also failed to identify any additional novel mammalian aspartyl proteases. This 
indicates that the C. elegans genome probably contains only a single lysosomal aspartyl 
protease which in mammals is represented by a gene family that arose through duplication 
and consequent modification of an ancestral gene, 

TBLASTN searches with T1SH9.2, the remaining C. elegans sequence, identified 
several ESTs which assembled into a contig encoding a novel human aspartyl protease (Hu- 
ASP1). As is described above in Example U BLASTX search with the Hu-ASPl contig 
against SWSS-PROT revealed that the active site motifs in the sequence aligned with the 
active sites of other aspartyl proteases. Exhaustive, repetitive rounds of BLASTN searches 
against LifeSeq, UfeSeqFL, and the public EST collections identified 102 EST from 
multiple cDN A libraries that assembled into a single contig. The 5 1 sequences in this 
contig found in public EST collections also have been assembled into a single cootig 
(THC2 13329) by The Institute for Genome Research (TIGR). The TIGR annotation 
indicates that they failed to find any bits in the database for the contig Note that the TIGR 
contig is the reverse complement of the UfeSeq contig that wt assembled. BLASTN search 
of Hu-ASPl against the rax and mouse EST sequences in ZooSeq revealed one homologous 
EST in each database (Incyte clone 70031 1523 and IMAGE clone 313341, GenBank 
accession number W10530, respectively). 

TBLASTN searches with the assembled DNA sequence for Hu-ASPl against both 
lifeSeqFL and the public EST databases identified a second, related human sequence (Hu- 
Asp2) represented by a single EST (2696295). Translation of this partial cDNA sequence 
reveals a single DTG motif which has homology to the active site motif of a bovine aspartyl 
protease, NML 
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BLAST searches, contig assemblies and multiple sequence alignments were 
performed using the bioinformatics tools provided with the lifeSeq. LifeSeqFL and 
LifeSeq Assembled databases from Incyte. Predicted protein motifs were identified using 
either the ProSite dictionary (Motifs in GCG 9) or the Pfam database. 
Full-length cDNA cloning of Hu-Aspl 

The open reading frame of C. eUgans gene T18H9.2CE was used to query Incyte 
USeStq and LifeSeq-FL databases and a angle electronic assembly referred to as 
J 863920CEI was detected The 5' most cDNA clone in this contig, 1863920, was obtained 
from Incyte and completely sequenced on both strands. Translation of the open reading 
frame contained within clone 1863920 revealed the presence of the duplicated aspaity! 
protease active site motif (DTG/DSG) but the 5* end was incomplete. The remainder of the 
Hu-Aspl coding sequence was determined by 5 l Marathon RACE analysis using a human 
placenta Marathon ready cDNA template (Qonctech). A 3'-antisease oligonucleotide 
pnmer specific fox the 5* end oi clone 1863920 was paired with the 5' -sense primer specific 
for the Marathon ready cDNA synthetic adaptor in the PCR. Specific PCR products were 
directly sequenced by cycle sequencing and the resulting sequence assembled with the 
sequence of clone 1863920 to yield the complete coding sequence of Hu-Asp-1 (SEQ ID 
No. 1). 

Several interesung features are present in the primary amino acid sequence 
of Hu-Aspl (Figure 1, SEQ ID No. 2). The sequence contains a >igoa! peptide (residues 1- 
20 at SEQ ID No. 2) f a pro-segment, and a catalytic domain containing two copies of the 
aspartyl protease active site motif (DTG/DSG). The spacing between the first and second 
active site motifs is about 200 residues which should correspond to the expected size of a 
single, eukaryouc aspartyl protease domain. More interestingly, the sequence contains a 
predicted transmembrane domain (residues 469-492 in SEQ ID No.2) near its C-terminus 
which suggests thai the protease is anchored in the membrane. This feature is not found in 
any other aspartyl protease, 
Ooniag of a fulMength Hu-Asp-2 cDNAs: 

As is described above in Example 1 , genome wide scan of the Caenorhabditis 
elegans database WormPepl2 for putative aspartyl proteases and subsequent mining of 
human EST databases reveaJed a human ortholog to the C elegans gene T18H9.2 referred 
to as Hu-Aspl. The assembltd contig foT Hu-Aspl was used to query for human paralogs 
using the BLAST search tool in human EST databases and a single significant match 
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(2696295CE1) with approximately 60% shared identity "*as found in the lifeSeq FL 
^ database. Similar queries of either gb!05PubEST oi the family of human databases 

available from T1GR did not identify similar EST domes. cDNA clone 2696295, identifier! 
by sing3e pass sequence analysis from a human uterus cDNA library, was obtained from 
10 5 focyte and completely sequence on both strands. This done contained an incomplete 1266 

bp open-reading frame that encoded a 422 amino acid polypeptide but lacked an initiator 
ATG on the 5' end. Inspection of the predicted sequence revealed the presence of the 
duplicated aspartyl protease active site motif DTG/DSG, separated by 194 amino acid 
residues. Subsequent queries of later releases of the LifeSeq EST database identified an 
10 additional ESTs, sequenced from a human astrocyte cDNA library (4386993), that appeared 
to contain additional 5' sequence relative to clone 2696295. Clone 4386993 was obtained 
irom Incyte and completely sequenced on both strands. Comparative analysis of clone 
43S6993 and done 2696295 confirmed that clone 4386993 extended the open-reading 
frame by 31 ammo acid residues including two in-frame translation initiation codons. 
15 Despite the presence of the two m-frame ATGs, no in-frame stop codon was observed 
upstream of the ATG indicating that the 4386993 may not be full-iength. Furthermore, 
alignment of the sequences of clones 2696295 and 4386993 revealed a 75 base pair 
insertion in clone 2696295 relative to clone 4386993 that results in the insertion of 25 
additional amino acid residues in 2696295. The remainder of the Hu-Asp2 coding sequence 
20 was determined by 5* Marathon RACE analysis using a human hippocampus Marathon 
ready cDMA template (Clonetech). A 3' -ami sense oligonucleotide pnmer specific for the 
shared 5'-region of clones 2696295 and 4386993 was paired with the 5'-sense primer 
specific for the Marathon ready cDNA synthetic adaptor in the PCR. Specific PCR 
products were directly sequenced by cycle sequencing and the resulting sequence assembled 
25 with the sequence of clones 2696295 and 4386993 to yield the complete coding sequence 
cf Hu-Asp2(a) (SEQ 3D No, 3) and Hu-Asp2(b) (SEQ ID No. 5}, respectively. 

Several interesting features are present in the primary amino acid sequence of Hu- 
Asp2(a) (Figure 2 and SEQ ID No. 4} and Hu-Asp-2(b) (Figure 3, SEQ ID No. 6). Both 
sequences contain a signal peptide (residues 1-21 in SEQ CD No. 4 and SEQ ID No. 6), a 
30 pro-segment, and a catalytic domain containing two copies of the aspartyl protease active 
site motif (DTG/DSG). The &pacmg between the fust and iccond active site motifs ifc 
variable due to the IS amfrto add residue deleaon in Hu- Asp-2(b) and consists of 1 68- 
50 versus-) 94 amino acid residues, for Hu-Asp2(b) and Hu-Asp-2(a), respectively. More 
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